skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Haudek, Kevin"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Free, publicly-accessible full text available April 23, 2026
  2. Mills, Caitlin; Alexandron, Giora; Taibi, Davide; Lo_Bosco, Giosuè; Paquette, Luc (Ed.)
    Open-text responses provide researchers and educators with rich, nuanced insights that multiple-choice questions cannot capture. When reliably assessed, such responses have the potential to enhance teaching and learning. However, scaling and consistently capturing these nuances remain significant challenges, limiting the widespread use of open-text questions in educational research and assessments. In this paper, we introduce and evaluate GradeOpt, a unified multiagent automatic short-answer grading (ASAG) framework that leverages large language models (LLMs) as graders for short-answer responses. More importantly, GradeOpt incorporates two additional LLM-based agents—the reflector and the refiner—into the multi-agent system. This enables GradeOpt to automatically optimize the original grading guidelines by performing self-reflection on its errors. To assess GradeOpt's effectiveness, we conducted experiments on two representative ASAG datasets, which include items designed to capture key aspects of teachers' pedagogical knowledge and students' learning progress. Our results demonstrate that GradeOpt consistently outperforms representative baselines in both grading accuracy and alignment with human evaluators across different knowledge domains. Finally, comprehensive ablation studies validate the contributions of GradeOpt's individual components, confirming their impact on overall performance. 
    more » « less
    Free, publicly-accessible full text available July 12, 2026
  3. Abstract We discuss transforming STEM education using three aspects: learning progressions (LPs), constructed response performance assessments, and artificial intelligence (AI). Using LPs to inform instruction, curriculum, and assessment design helps foster students’ ability to apply content and practices to explain phenomena, which reflects deeper science understanding. To measure the progress along these LPs, performance assessments combining elements of disciplinary ideas, crosscutting concepts and practices are needed. However, these tasks are time-consuming and expensive to score and provide feedback for. Artificial intelligence (AI) allows to validate the LPs and evaluate performance assessments for many students quickly and efficiently. The evaluation provides a report describing student progress along LP and the supports needed to attain a higher LP level. We suggest using unsupervised, semi-supervised ML and generative AI (GAI) at early LP validation stages to identify relevant proficiency patterns and start building an LP. We further suggest employing supervised ML and GAI for developing targeted LP-aligned performance assessment for more accurate performance diagnosis at advanced LP validation stages. Finally, we discuss employing AI for designing automatic feedback systems for providing personalized feedback to students and helping teachers implement LP-based learning. We discuss the challenges of realizing these tasks and propose future research avenues. 
    more » « less
  4. The Framework for K-12 Science Education recognizes modeling as an essential practice for building deep understanding of science. Modeling assessments should measure the ability to integrate Disciplinary Core Ideas and Crosscutting Concepts. Machine learning (ML) has been utilized to score and provide feedback on open-ended Learning Progression (LP)-aligned assessments. Analytic rubrics have been shown to be easier to evaluate the validity of ML-based scores. A possible drawback of using analytic rubrics is the potential for oversimplification of integrated ideas. We demonstrate the deconstruction of a 3D holistic rubric for modeling assessments aligned LP for Physical Science. We describe deconstructing this rubric into analytic categories for ML training and to preserve its 3D nature. 
    more » « less
  5. Free, publicly-accessible full text available December 1, 2025
  6. This article builds on the work of Scott et al. (Scott EE, Cerchiara J, McFarland JL, Wenderoth MP, Doherty JH. J Res Sci Teach 1: 37, 2023) and Shiroda et al. (Shiroda M, Fleming MP, Haudek KC. Front Educ 8: 989836, 2023) to quantitatively examine student language in written explanations of mass balance across six contexts using constructed response assessments. These results present an evaluation of student mass balance language and provide researchers and practitioners with tools to assist students in constructing scientific mass balance reasoning explanations. 
    more » « less
  7. We novelly applied established ecology methods to quantify and compare language diversity within a corpus of short written student texts. Constructed responses (CRs) are a common form of assessment but are difficult to evaluate using traditional methods of lexical diversity due to text length restrictions. Herein, we examined the utility of ecological diversity measures and ordination techniques to quantify differences in short texts by applying these methods in parallel to traditional text analysis methods to a corpus of previously studied college student CRs. The CRs were collected at two time points (Timing), from three types of higher-ed institutions (Type), and across three levels of student understanding (Thinking). Using previous work, we were able to predict that we would observe the most difference based on Thinking, then Timing and did not expect differences based on Type allowing us to test the utility of these methods for categorical examination of the corpus. We found that the ecological diversity metrics that compare CRs to each other (Whittaker’s beta, species turnover, and Bray–Curtis Dissimilarity) were informative and correlated well with our predicted differences among categories and other text analysis methods. Other ecological measures, including Shannon’s and Simpson’s diversity, measure the diversity of language within a single CR. Additionally, ordination provided meaningful visual representations of the corpus by reducing complex word frequency matrices to two-dimensional graphs. Using the ordination graphs, we were able to observe patterns in the CR corpus that further supported our predictions for the data set. This work establishes novel approaches to measuring language diversity within short texts that can be used to examine differences in student language and possible associations with categorical data. 
    more » « less
  8. Gardner, Stephanie (Ed.)
    This paper details the development of the first reasoning framework to describe how students’ reasoning about biological bulk flow pressure gradients develop toward scientific, mechanistic reasoning. 
    more » « less
  9. These findings are the first empirical evidence to support the claim that using Physiology Core Concept reasoning supports transfer of knowledge across different physiological systems. 
    more » « less